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Abstract. The past decade has witnessed many interesting algorithms for maintaining 
statistics over a data stream. This paper initiates a theoretical study of algorithms for 
monitoring distributed data streams over a time-based sliding window (which contains a 
variable number of items and possibly out-of-order items). The concern is how to mini- 
mize the communication between individual streams and the root, while allowing the root, 
at any time, to be able to report the global statistics of all streams within a given error 
bound. This paper presents communication-efficient algorithms for three classical statis- 
tics, namely, basic counting, frequent items and quantiles. The worst-case communication 
cost over a window is 0(-| log ^jf) bits for basic counting and 0(| log words for the 
remainings, where k is the number of distributed data streams, TV is the total number of 
items in the streams that arrive or expire in the window, and e < 1 is the desired error 
bound. Matching and nearly matching lower bounds are also obtained. 



1. Introduction 

The problems studied in this paper are best illustrated by the following puzzle. John 
and Mary work in different laboratories and communicate by telephone only. In a forever- 
running experiment, John records which devices have an exceptional signal in every 10 
seconds. To adjust her devices, Mary at any time needs to keep track of the number of 
exceptional signals generated by each device of John in the last one hour. John can call 
Mary every 10 seconds to report the exceptional signals, yet this requires too many calls in 
an hour and the total message size per hour is linear to the total number N of exceptional 
signals in an hour. Mary's devices actually allow some small error. Can the number of 
calls and message size be reduced to o(N), or even poly-log N if a small error (say, 0.1%) is 
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allowed? It is important to note that the input is given online and Mary needs to know the 
answers continuously; this makes our problem different from those in other similar classical 
models, such as the Simultaneous Communication Complexity model [3J, in which all inputs 
are given in advance and the parties need to compute an answer only once. 

Motivation. The above problem appears in data stream applications, e.g., network 
monitoring or stock analysis. In the last decade, algorithms for continuous monitoring of a 
single massive data stream gained a lot of attention (sec [l,26j for a survey), and the main 
challenge has been how to represent the massive data using limited space, while allowing 
certain statistics (e.g., item counts, quantiles) to be computed with sufficient accuracy. 

The space-accuracy tradeoff for representing a single stream has gradually been un- 
derstood over the years (e.g., (2j[T5l[l8l[l9]). Recently, motivated by large scale networks, 
the database community is enthusiastic about communication-efficient algorithms for con- 
tinuous monitoring of multiple, distributed data streams. In such applications, we have 
k > 1 remote sites each monitoring a data stream, and there is a root (or coordinator) 
responsible for computing some global statistics. A remote site needs to maintain cer- 
tain statistics itself, and has to communicate with the root often enough so that the root 
can compute, at any time, the statistics of the union of all data streams within a certain 
error. The objective is to minimize the communication. The communication aspects of 
data streams introduce several challenging theoretical questions such as what is the opti- 
mal communication-accuracy tradeoff for maintaining a particular statistic, and whether 
two-way communication is inherently more efficient than one-way communication. 

Data stream models and e-approximate queries. The data stream at each remote 
site is a sequence of items from a totally ordered set U. Each item is associated with an 
integral time-stamp recording its arrival time. Each remote site has limited space and hence 
it can only maintain the required statistics approximately. The statistics can be based on 
the whole data stream [2j[15l[18l[19] or om y the recent items [3j[lH[22]. Recent items can 
be modeled by two types of sliding windows [5l[T3]. Let W be the window size, which is 
a positive integer. The count-based sliding window includes the last W items in the data 
stream, while the time-based sliding window includes items whose time-stamps are within 
the last W time units. The latter assumes that zero or more items can arrive at a time. 
Items in a sliding window will expire and are more difficult to handle than in the whole 
data stream. For example, counting the frequency of a certain item in the whole stream 
can be done easily by maintaining a single counter, yet the same problem requires space 
0(i log 2 (eW)) bits for a count-based sliding window even if we allow a relative error of at 
most e |13tll6|. In fact, the whole data stream model can be viewed as a special case of 
the sliding window model with window size being infinite. Also, a count-based window is a 
special case of a time-based window in which exactly one item arrives at a time. This paper 
focuses on time-based window, and the algorithms are applicable to the other two models. 

We study algorithms that enable the root to answer three types of classical e-approximate 
queries, defined as follows. Let < e < 1. For any stream a, let Cj >a and c a be the count 
of item j and all items whose timestamps are in the current window, respectively. Denote 
Cj = Yla c j,ct an d c = ^2 a c a as the total count of item j and all items in all the data 
streams, respectively. 

• Basic Counting. Return an estimate c on the total count c such that |c — c| < ec. 
(Note that this query can be generalized to count data items of a fixed subset X C U; 
the literature often refers to the special case with U = {0, 1} and X = {!}.) 
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• Frequent Items. Given any < 4> < 1, return a set F C U which includes all items 
j with Cj > </>c and possibly some items j' with cy > 4>c — ec. 

• Quantiles. Given any < (ft < 1, return an item whose rank is in [c^c — ec, (fic + ec] 
among the c items in the current sliding window. 

As in most previous works, we need to answer the following type of e-approximate queries 
in order to answer queries on frequent items. 

• Approximate Counting. Given any item j, return an estimate dj such that |c,- — Cj \ < 
ec. (Note that this query gives estimate for any item, not just the frequent items. 
Also, the error bound is in term of c, which may be much larger than Cj.) 

We need an algorithm to determine when and how the remote sites communicate with 
the root so that the root can answer the queries at any time. The objective is to minimize 
the worst-case communication cost within a window of W time units. 

Previous works. Recently, the database literature has a flurry of results on continuous 
monitoring of distributed data streams, e.g. [HIEIEIIISIIITIISOKMKISKSZIIIH] . The algorithms 
studied can be classified into two types: one-way algorithms only allow messages sent from 
each remote site to the root, and two-way algorithms allow bi-directional communication 
between the root and each site. One-way algorithms are often very simple as a remote 
site has little information and all it can do is to update the root when its local statistics 
deviate significantly from those previously sent. On the other hand, most two-way algo- 
rithms are complicated and often involve non-trivial heuristics. It is commonly believed 
in the database community that two-way algorithms are more efficient; however, for most 
existing two-way algorithms, their worst-case communication costs are still waiting for rig- 
orous mathematical analysis, and existing works often rely on experimental results when 
evaluating the communication cost. 

The literature contains several results on the mathematical analysis of the worst-case 
performance of one-way algorithms. They are all for the whole data stream setting. Ker- 
alapura et al. [2l] studied the thresholded-count problem, which leads to an algorithm for 
basic counting with communication cost 0(- log words, where k and N are the number 
of streams and the number of items in these streams, respectively. Cormode et al. [9] gave 
an algorithm for quantiles with communication cost O(pTog^) words per stream. They 
also showed how to handle frequent items via a reduction to quantiles, so the communication 
cost remains the same. More recently, Yi and Zhang [29j have reduced the communication 
cost for frequent items to log ^) words, and quantile to 0(| log 2 (i) log ^) words, using 
some two-way algorithms; these are the only analyses for two-way algorithms so far. 

There have been attempts to devise heuristics to extend some whole-data-stream al- 
gorithms to sliding windows, yet not much has been known about their worst-case perfor- 
mance. For example, Cormode et al. [9] have extended their algorithms for quantiles and 
frequent items to sliding windows. They believed that the communication cost would only 
have a mild increase, but no supporting analysis has been given. The analysis of sliding- 
window algorithms is more difficult because the expiry of items destroys some monotonic 
property that is important to the analysis for whole data stream. In fact, finding sliding- 
window algorithms with efficient worst-case communication has been posed as an open 
problem in the latest work of Yi and Zhang [29J . 

Our results. This paper gives the first mathematical analysis of the communication 
cost in the sliding window model. We derive lower bounds on the worst-case communication 
cost of any two-way algorithm (and hence any one-way algorithm) for answering the four 
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Basic Counting 
(bits) 


Approximate Counting/ 
Frequent items (words) 
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od log f) m\ 


0(§ log 2 (i)logf) [29] 
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9(| l 0g eN) 


0(1 log £) 


O(^logf) 


0(| log ^) 


Sliding window 
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0((^)|log^) 


0((^)|l°gf) 


0((^_)£]og£) 


ft( max {i flogf^}) 


n(max{i,flog^}) 



Table 1: Bounds on the communication costs. Note that the bounds are stated in bits for 
basic counting, and in words for the other problems. 



types of e-approximate queries. These lower bounds hold even when each remote site has 
unlimited space to maintain the local statistics exactly. More interestingly, we analyze some 
common-sense algorithms that use one-way communication only and prove that their com- 
munication costs match or nearly match the corresponding lower bounds. In our algorithms, 
each remote site only needs to maintain some 0(e)-approximate statistics for its local data, 
which actually adds more complication to the problem. These results demonstrate optimal 
or near optimal communication-accuracy tradeoffs for supporting these queries over the 
sliding window. Our work reveals that two-way algorithms could not be much better than 
one-way algorithms in the worst case. 

Below we state the lower and upper bounds precisely. Recall that there are k remote 
sites and the sliding window contains W time units. We prove that within any window, 
the root and the remote sites need to communicate, in the worst case, Sl(jlog^) bits 
for basic counting and 0(| log ^jf) words for the other three queries, where N is the total 
number of items arriving or expiring within that windowQ For upper bounds, our analysis 
shows that basic counting requires O(^log^) bits within any window, and approximate 
counting O(^log^r) words. The estimates given by approximate counting are sufficient 
to find frequent items, hence the latter problem has the same communication cost. For 
quantiles, it takes 0(-p log £-) words. See the second row (sliding window) of Table Q] for a 
summary. 

As mentioned before, sliding-window algorithms can be applied to handle the special 
case of whole data streams in which the window size W is infinite and N is the total number 
of arrived items. The first row of Table [T] shows the results on whole data streams. Our 
work has improved the communication cost for basic counting from 0(~ log £) words [21] to 
0(| log ^jf) bits. For approximate counting and frequent items, our work implies a one-way 
algorithm with communication cost of 0(| log £■) words; this matches the performance of 
the two-way algorithm by Yi and Zhang [2S]. In their algorithm, the root regularly updates 
every remote site about the global count of all items. In contrast, we use the idea that 



Note that the number of items arriving or expiring within window [t — W + l,t] is no greater than the 
number of items arriving within [t — 2W + 1, t). 
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items with small count could be "turned off" for further updating. As a remark, our upper 
bound on quantiles is 0{^\ log words which is weaker than that of [29]. 

Our algorithms can be readily applied to out-of-order streams (TJ.H0]- In an out-of- 
order stream, each item is associated with an integral time-stamp recording its creation 
time, which may be different from its arrival time. We say that the stream has tardiness 
t if any item with time-stamp t must arrive within r time units from t, i.e., at any time 
in [t, t + t]. Without loss of generality, we assume that r 6 {0, 1, 2, . . . , W — 1} (if an item 
time-stamped at t arrives after t + W — 1, it has already expired and can be ignored). Note 
that for any data stream with tardiness greater than zero, the items may not be arriving in 
non-decreasing order of their time-stamps. Our previous discussion of data streams assumes 
tardiness equal to 0, and such data streams are called in-order data streams. The previous 
lower bounds for in-order streams are all valid in the out-of-order setting. In addition, we 
obtain lower bounds related to t, namely, 0( ) bits for basic counting and Q{ y^_ T ) 
words for the other three problems. Regarding upper bounds, our algorithms when applied 
to out-of-order streams with tardiness r will just increase the communication cost by a 
factor of y^_ T • The results are summarized in the last row of Table [TJ 

The idea for basic counting is relatively simple. As the root does not require an exact 
total count, each data stream can communicate to the root only when its local count in- 
creases or decreases by a certain ratio e > 0; we call such a communication step an up or 
down event, respectively. To answer the total count of all streams, the root simply sums up 
all the individual counts it has received. It is easy to prove that this answer is within some 
desired error bound. If each count is over the whole stream (i.e., window size = oo and N 
is the total number of arrived items), the count is increasing and there is no down event. A 
stream would have at most 0(log 1+£ N) up events and the communication cost is at most 
that many words. However, the analysis becomes non-trivial in a sliding time window. Now 
items can expire and down events can occur. An up event may be followed by some down 
events and the count is no longer increasing. The tricky part is to find a new measure of 
progress. We identify a "characteristic set" of each up event such that each up event must 
increase the size of this set by a factor of at least 1 + e, hence bounding the number of up 
events to be 0(log 1+e N). Down events are bounded using another characteristic set. Due 
to space limitation, the details can only be given in the full paper. 

Approximate counting of all possible items is much more complicated, which will be 
covered in details in the rest of this paper. Assuming in-order streams, we derive and 
analyze two algorithms for approximate counting in Section [2j In Section [3l we discuss 
frequent items, quantiles, and finally out-of-order streams. The lower bound results are 
relatively simple and omitted due to space limitation. 

2. Approximate Counting of all items 

This section presents algorithms for the streams to communicate to the root so that 
the root at any time can approximate the count of each item. As a warm-up, we first 
consider the simple algorithm in which a stream will inform the root whenever its count 
of an item increases or decreases by a certain fraction of its total item count. We show 
in Section [2.11 that within any window of W time units, each data stream <7j (1 < i < k) 
needs to send at most 0((A + ~)lognj) words to the root, where A is the number of 
distinct items and rii is the number of items of o~i that arrive or expire within the window. 
Then, the total communication cost within this window is ^i<«<fc(^ + j) l°g n i> which, by 
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Jensen's inequality, is no greater than (A -|- log(^^i<j</% ^* 

)/k = (A + i)fclogf where 

N = Yli<i<k n*. We then modify the algorithm so that a stream can "turn off" items whose 
counts are too small, and we give a more complicated analysis to deal with the case when 
many such items increase their counts rapidly (Section |2.2j) . The communication cost is 
reduced to 0(~ log ^) words, independent of A. 

2.1. A simple algorithm 

Consider any stream a. At any time t, let c(t) and Cj(t) be the number of all items 
and item j arriving at a in [t — W + 1, t], respectively. Let A<l/llbea positive constant 
(which will be set to e/11). We maintain two A-approximate data structures [I3j[23] at °~ 
locally, which can report estimates c(i) and Cj(t) for c(t) and Cj(t), respectively, such that0 

(1 - A/6)c(t) < c(t) < (1 + A/6)c(t); and Cj(t) - Ac(i) < e,(i) < Cj(i) + Ac(t). 



Simple algorithm. At any time t, for any item j, let p < t be the last time 
Cj(p) is sent to the root. The stream sends the estimate (j, Cj(t)) to the root if the 
following event occurs. 

• Up: dj(t) > Cj(p) + 9Xc(t). 

• Down: Cj{t) < &j(p) — 9Xc(t). 

Root's perspective. At any time t, let rj^ a {t) be the last estimate received from a stream a 
for item j (at or before t). The root can estimate the total count of item j over all streams by 
summing all rj^{t) received. More precisely, for any < e < 1, we set A = e/11 and let each 
stream use the simple algorithm. Then for each stream a, the approximate data structures 
for Cj(t) and c(t) together with the simple algorithm guarantee that Cj(t) — llAc(i) < 
fj,cr(t) < Cj(t) + llAc(t). Summing rj j<T (t) over all streams would give the root an estimate 
of the total count of item j within an error of e of the total count of all items. 
Communication Complexity. At any time t, we denote the reference window as [i 0) £]> 
where t Q = t — W + 1. Let n be the number of items of a that arrive or expire in [£ ,t]. 
Assume that there are at most A distinct items. We first show that a stream a encounters 
0((j + A) logn) up events and sends 0((j + A) logn) words within [t a ,t]. The analysis of 
down events is similar and will be detailed later. For any time t\ <t2, it is useful to define 
o~u ljt2 ] (resp. fjjtijfcj]) as * ne multi-set of all items (resp. item j only) arriving at a within 
[ti,i2] 3 and |crt 1 t 2 ]| as the size of this multi-set. 

Consider an up event Uj of some item j that occurs at time v S [t ,t\. Define the 
previous event of Uj to be the latest event (up or down) of item j that occurs at time p < v. 
We call p the previous- event time of Uj. The number of up events with previous-event 
time before t is at most A. To upper bound the number of up events with previous-event 
time p > t Q is, however, non-trivial; below we call such an up event a follow-up (event). 
Intuitively, a follow-up can be triggered by frequent arrivals of an item, or mainly the 
relative decrease of the total count. This motivates us to classify follow-ups into two types 
and analyze them differently. A follow-up Uj is said to be absolute if c(p) < |c(v), and 

2 The constant 6 in the inequality is arbitrary. It can be replaced with any number provided that 
other constants in the algorithm and analysis (e.g., the constant 9 in definition of up events) are adjusted 
accordingly. 
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relative otherwise. Define Recent-items(Uj) to be the multi-set of item j's that arrive after 
the previous event of Uj, i.e., Recent-itemsiU j) — Cjjp+i^]. 

Absolute follow-ups. To obtain a tight bound of absolute follow-ups, we need a 
characteristic-set argument that can consider the growth of different items together. Let 
ti,t2, ■■■,tk be the times in [t ,t] when some absolute follow-ups (of one or more items) 
occur. Let Xi be the number of items having an absolute follow-up at tj. Note that for all 
i, Xi < min{l/(7A), A}J1 and Y2i=i x i * s the number of absolute follow-ups in [t ,t]. We 
define the characteristic set Si at each ti as follows: 

Si = the union of Recent-items(Uj) over all absolute follow-ups Uj occurring at t\,tz, ■ ■ ■ ,t%. 
Recall that n is the number of items of a that arrive or expire in [t — W + 1, t]. 

Lemma 2.1. (i) For any 2 < i < k, \Si\ > (1 + 6xiA)|5j_i|. (ii) There are Yli=i x i = 
O(jlogn) absolute follow-ups within [t ,t]. 

Proof. For (i), consider an absolute follow-up Uj of an item j, occurring at time ti with 
previous-event time pi. Note that the increase in the count of item j from pi to ti must be 
due to the recent items. We have 

\Recent-items(Uj)\ > Cj(ti) — Cj(pi) 

> dj(ti) — Cj(pi) — Ac(tj) — Xc(pi) (by cr's local data structures) 

> 9Ac(ij) — \c{ti) — Xc(pi) (definition of an up event) 

> (9A(1 - |) - A - §A)c(ij) > 6Ac(ti) (Uj is absolute) 

There are X{ absolute follow-ups at ti, so \Si\ > \Si-i\ + (6Ac(tj)). Since Si C <Jr to|t .i, 
c{ti) > \Si\ > Therefore, we have \Si\ > \Si-i\ + 6xjAj5j| > (1 + 6xjA)|5i_i|. 

For (ii), we note that n > \S k \ > fliUC 1 + Gx i^)\Si\, and [S^ > 1. Thus, rii=2( 1 + 
6xjA) < n, or equivalently, Inn > X^i=2 m (l + 6xjA). The latter is at least Yli=2 i+6x \ — 
^Yli=2 x i- ^^- e ^ as t inequality follows from that Xi < 1/(7A) for all i. Thus, Yli=i x i — 
x\ + \ Inn = 0(j logn). ■ 

Relative follow-ups. A relative follow-up occurs only when a lot of items expire, 
and relative follow-ups of the same item cannot occur too frequently. Below we define 
O(logn) time intervals and argue that no item can have two relative follow-ups within an 
interval. For an item with time-stamp t\, we define the first expiry time to be t\ + W. At 
any time u in [£ D ,i], define H u to be the set of all items whose first expiry time is within 
[it + l,t], i.e., H u = a\ u _w+l,to-l]- \^u\ 1S non-increasing as u increases. Consider the 
times t Q = uq < ui < U2 < • • • < w < t such that for i > 1, m is the first time such that 
|.ffml ^ fl-f^Mi-il- For convenience, let ue + \ = t + l. Note that \H Uo \ < n and i = O(logn). 

Lemma 2.2. (i) Every item j has at most one relative follow-up Uj within each interval 
[ui,Ui+i — 1]. (ii) There are at most O(Alogn) relative follow-ups within [t ,i\. 

Proof. For (i), assume Uj occurs at time v in [uj, Ui + \ — 1], and its previous event occurs at 
time p. By definition, c(p) > |c(u). Thus, 

\H P \ - \H V \ = \a [p _ w+l ^ v _ w] \ > c(p) - c{v) > \c(v) > %\afr. w+ltte _ 1] \ = l\H v \ , 

^ If an up event of an item j occurs at time ti, then Cj(U) > &j(ti) — Xc(ti) > 9Xc(ti) — Ac(tj) > 7Xc(ti). 
Thus the number of up events at time ti is at most c(ti)/(7\c(ti)) — 1/(7A). 
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and \H V \ < §|-Hp|. Since v < Uj+i and \H V \ > ^\H Ui \, we have \H P \ > \H Ui \ and p < itj. 
For (ii), there are A distinct items, so there are at most A relative follow-ups within each 
interval [m, in+i — 1], and at most O(Alogn) relative follow-ups within [t ,t]. m 

Down events. The analysis is symmetric to that of up events. The only non-trivial 
thing is the definition of the characteristic set for bounding the absolute follow-downs Dj, 
which is defined in an opposite sense: Assume Dj occurs at time v and its previous event 
occurs at p > t Q . Dj is said to be absolute if c(p) < §c(i>). Let Expire(Dj) be the multi-set 
of item j's whose first expiry time is within [p + l,v]. I.e., Expire(Dj) = (jj t ip_w+l,v—W]- 

It is perhaps a bit tricky that instead of defining the characteristic set of absolute 
follow-downs at the time they occur, we consider the times of the corresponding previous 
events of these follow-downs. Let px,P2, ■■■,Pk be the times in [t ,t] such that there is at 
least one event Ej (up or down) at pi which is the previous event of an absolute follow-down 
Dj occurring after pi. Let yi be the number of such previous events at pi, and let AD(pi) 
be the set of corresponding absolute follow-downs. Note that yi (unlike x{) only admits a 
trivial upper bound of A. We define the characteristic set Tj for each p$ as follows: 

Ti = the union of Expire(Dj) over all Dj € AD(pi), AD{pi + i), . . . , AD{pk). 
Similar to Lemma 12. 11 we can show that |Tj| > (1 + 5yjA)|Tj+i|. Owing to a weaker bound 
of individual yi, the number of absolute follow-downs, which equals Y2i=i Vii ls shown to be 
0((i + A)logn). 

Combining the analyses on up and down events, and let A = e/11, we have the following. 

Theorem 2.3. The simple algorithm sends at most 0((- + A)logn) words to the root 
during window [t — W + l,t]. 

2.2. The full algorithm 

In this section, we extend the previous algorithm and give a new characteristic-set 
analysis that is based on future events (instead of the past events) to show that each 
stream's communication cost per window can be reduced to O(^logn) words. Then, by 
Jensen's inequality again, we conclude that the total communication cost per window is 
0(f l°g if) - Intuitively, when the estimate cj(t) of an item j is too small, say, less than 
3Ac(t), the algorithm treats this estimate as and set the offj flag of j to be true. This 
restricts the number of items with a positive estimate to O(j). Initially, the offj flag is true 
for all items j. Given < A < e/11, the stream communicates with the root as follows. 

Algorithm AC. At any time t, for any item j, let p < t be the time the last 
estimate of j, i.e., Cj(p), is sent to the root. The stream sends the estimate of j to 
the root if the following event occurs. 

• Up: If Cj(t) > Cj{p) + 9Ac(t), send (j, Cj{t)) and set offj = false . 

• Off: If offj = false and Cj(t) < 3Ac(t), reset Cj{t) to 0, send (j, Cj(t)} 

and set offj = true. 

• Down: If offj = false and Cj(t) < Cj{p) — 9Xc(t), send (j,Cj(t)). 

It is straightforward to check that the root can answer the approximate counting query 
for any item. We analyze the communication complexity of different events as follows. 
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Fact 1. At any time v, the number of items j with offj = false is at most ^@ 

Off events. Recall that we are considering the window [t 0! £]> and n is the number of 
items arriving or expiring within [t ,i]. By Fact[H just before t a , there are at most 4 items 
with offj = false. Within [t ,t], only an up event can set the off flag to false. Thus the 
number of off events within [t D , t] is bounded by j plus the number of up events. 
Up and Down events. The assumption of A gives a trivial bound on those events 
involving items with very small counts and in particular, those up events immediately 
following the off events. Such up events are called poor-up events or simply poor-ups. Using 
the off flag, we can easily adapt the analysis of the simple algorithm to bound all the 
down and up events of the full algorithm, but except the poor-ups. The following simple 
observations, derived from Fact 1, allow us to replace A with 1/A in the previous analysis 
to obtain a tighter upper bound of 0(j logn). Let v be any time in [t ,t]. 

• There are at most 1/A items whose first event after v is a down event. 

• There are at most 1/A non-poor-up events after v whose previous event is before v. 

It remains to analyze the poor-ups. Consider a poor-up Uj at time v in [£ 0) t]- By 
definition, offj = false at time v. The trick of analyzing Uj's is to consider when the 
corresponding items will be "off" again instead of what items constitute the up events. 
Then a characteristic set argument can be formulated easily. Specifically, we first observe 
that, by Fact 1, there are at most j poor-ups whose off flags remain false up to time t. 
Then it remains to consider those Uj whose off flags will be set to true at some time d < t. 
Below we refer to d as the first off time of Uj . 

Poor-up with early off. Consider a poor-up Uj that occurs at time v in [t ,t] and has 
its first off time at d in [v + 1, t]. Let F-Expire(Uj) be all the item j whose first expiry time 
is within [v + 1, d]. I.e., F-Expire(Uj) = o~j \ v +i-w,d-w]' As an early off can be due to the 
expiry of many copies of item j or the arrival of a lot of items, it is natural to divide the 
poor-ups into two types: with an absolute off if c(d) < |c(v), and relative off otherwise. For 
the case with absolute off, we consider the distinct times ti,tz, ■ ■ ■ , i& in [t Q ,t] when such 
poor-ups occur. Let x, be the number of such poor-ups at time ij. Note that Xi < 1/(7A). 
For each time ij, we define the characteristic set 

Fi = the union of F-Expire(Uj) over all Uj occurring at f$, ti+i, . . . ,tf.. 

Lemma 2.4. (i) For any 1 < i < k — 1, \Fi\ > (1 + XiX)\Fi + i |. (ii) Within [t ,t], there are 
Yli=i x i = O(jlogn) poor-ups each with an absolute off. 

Proof. For (i), consider an item j and a poor-up Uj with an absolute off that occurs at time 
U and has its first off at time d%. The decrease in cj must be due to expiry of item j. 

\F-Expire(Uj)\ > Cj(U) - Cj{<k) > Cj(tj) - Cj(ck) - \c(U) - Ac(dj) 

> 9Ac(ij) — 3Ac(dj) — Ac(tj) — Ac(dj) (definition of up and off) 

> (9A(1 - |) - A)c(ii) - (3A(1 + D + X)c(di) > 7\c{U) - 5Xc(di) 

> (7 — 5(|))Ac(tj) = Ac(tj) (definition of absolute off) 

Thus, \Fi\ > \F i+ i\ + Xi(Xc(ti)). Since Fi C o-[ t ,_ w+ljt _ W ], |F f | < c(U). Therefore, \Fi\ > 
li^+il +XjA|-Fj| > (1 + Xi\)\Fi + i\. By (i), we can prove (ii) similarly to Lemma |2. II (ii). ■ 



For any item j, if off a = false, then Cj(v) > 3Xc(v) and Cj{v) > Cj (y)—\c(v) > (3A(1— A)— A)c(w) > \c(v). 



Thus the number of items j with off, = false is at most c(v)/Xc(v) = j 



x ■ 
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Analyzing poor-ups with a relative off is again based on an isolating argument. We 
divide [t a ,t] into O(logn) intervals according to how fast the total item count starting 
from t Q grow; specifically, we want two consecutive time boundaries Uj_i and m to satisfy 
l^fto.Uill > f lutein- 1] I- Then we show that for any poor-up within [ui-\,Ui — 1], its relative 
off, if exists, occurs at or after U{. Thus there are at most j such poor-ups within each 
interval and a total of O(-^logre) within [t ,t]. 

Lemma 2.5. (i) Consider a poor-up Uj with a relative off. Suppose it occurs at time v in 
[t a ,t], and its first off time is at d in [v + l,t]. Then \au 0) s\ > § |<7[t a;V ]|. (ii) Within [t ,t], 
there are at most O(^logn) poor-ups each with a relative off. 

Proof. For (i), by the definition of a relative off, c{d) > |c(w). Thus, | I — l <T [t OI i;]l = 

|o>+mI - c ( d ) ~ c ^ > - \\ a \toA\- This im P lies W[t ,d] \ > |l°"[t ,«]l- 

For (ii), consider the times t a = u$ < u\ < U2 < • • ■ < ui < t such that for i > 1, 

Ui is the first time such that |c[t 0jU ji| > f l cr [t ,u i _i] I- F° r convenience, let ui + \ = t + 1. 
Note that |<Tu t ]| < n and £ = O(logn). Furthermore, for any time v € [ui-i,Ui — 1], 
\o~[t 0> v] I < fkltcui-ijl- Therefore, by (i), for any poor-up of an item j within [ui-i, it, - 1], its 
relative off, if exists, occurs at or after itj, which implies at time U{ — 1, Cj(ui — 1) > Xc(ui— 1). 
Then within each interval [ui-i, Ui — 1], the number of such j as well as the number of poor- 
ups with a relative off are at most j. Within [t ,t], there are I = O(logra) intervals and 
hence 0(j logn) poor-ups each with a relative off. ■ 

Theorem 2.6. For approximate counting, each individual stream can use the algorithm AC 
with A = and it sends at most O(ilogn) words to the root within a window. 

Memory usage of each remote site. Recall that we use two A-approximate data 
structures [T31E3] for the total item count and individual item counts, which respectively 
require 0(j log 2 (An)) bits and 0(\) words. Note that 0(^log 2 (An)) bits is equivalent to 
0(-^log(An)) words. Furthermore, at any time, we only need to keep track of the last 
estimate sent to the root of all item j with offj = false, which by Fact [TJ requires O(j) 
words. By setting A = e/11 (see Theorem \2.6\i . the total memory usage of a remote site is 
0(^log(An)) = 0(Mog(en)) words. 

3. Extensions 

We extend the previous techniques to solve the problems of frequent items and quantiles 
and handle out-of-order streams. Below BC refers to our algorithm for basic counting. 

Frequent items. Using the algorithms BC and AC, the root can answer the e- 
approximate frequent items as follows. Each stream a communicates with the root using 
BC with error parameter e/24 and AC with error parameter lle/24. At any time t, let 
r a {t) and rj tU (t) be the latest estimates of the numbers of all items and item j, respectively, 
received by the root from a. To answer a query of frequent items with threshold <f> 6 (0, 1] 
at time t, the root can return all items j with J2a r j,°~(t) — (0 — I ) £ ff r ffW as the set of 
frequent items. 

To see the correctness, let c a {t) and Cj, a {t) be the number of all items and item j in a 
at time t, respectively. Algorithm BC guarantees \r a (t) — c a (t)\ < ^gc a (t), and algorithm 
AC guarantees \rj^(t) — Cj^ a (t)\ < ^c a (t). Therefore, if an item j is returned by the 
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root, then > E^>(i) - ^E CT c CT (t) > (0 - f)E«,»v(i) ~ ^Ea^t) > 

(0 - §)(1 - Jj) E ff -f L ca(*) > (0 - | 7 <f>M - IT) E ct where the second 
inequality comes from the definition of the algorithm. The last term above is at least 
(4> ~ e )Eo- c o-(0> so 3 i s a frequent item. If an item j is not returned by the root, then 
Ecr r jA t ) < (0 _ |) E<t and we can show similarly that E CT c .?>(*) < 0E CT C ^W- 

Quantiles. We give an algorithm for e-approximate quantiles queries. Let A = e/20. 
For each stream, we keep track of the A-approximate (^-quantiles for <p = 5A, 10A, 15A, . . . , 1. 
We update the root for all these 0-quantiles when one of the following two events occurs: 
(i) for any k, the value of the (5A;A)-quantile is larger than the value of the (5(fc + 1)A)- 
quantile last reported to the root, or (ii) for any k, the value of the (5A;A)-quantile is 
smaller than the value of the (5(k — l)A)-quantile last reported to the root. The stream also 
communicates with the root using BC with error parameter A. In the root's perspective, 
at any query time t, let (j) € (0, 1] be the query given and let r a (t) be the last estimate 
sent by a for the number of all items. The root sorts the quantiles last reported by all 
streams and for each stream a, gives a weight of 5Xr a (t) to each quantile of a. Then the 
root returns the smallest item j in the sorted sequence such that the sum of weights for all 
items no greater than j is at least \(p Eo- r o-(^)l • Careful counting can show that j is an 
e-approximate ^-quantile. To bound the communication cost, let n be the number of items 
of a arriving or expiring during the window [t — W + 1, t]. We observe that when an event 
occurs, many items have either arrived or expired after the previous event. Using similar 
analysis as before, we can show that within a window, there are at most O(^logn) such 
events and thus each stream sends O(p-logn) words. By Jensen's inequality again, our 
algorithm's total communication cost per window is 0(J? log ^) where N is the number of 
items of the k streams that arrive or expire within the window. Note that the lower bound 
of 0(jlog(en)) words for approximate frequent items carries to approximate quantiles, as 
we can answer approximate frequent items using approximate quantiles as follows. The 
root poses e-approximate 0-quantile queries for <j> = e,2e, ... ,1. Given the threshold 4>' 
for frequent items, the root returns all items that repeatedly occur as ^ — 2 (or more) 
consecutive quantiles, and these items are (4e)-approximate frequent items. 

Out-of-order streams. All our algorithms can be extended to out-of-order stream 
with a communication cost increased by a factor of yy_ T , as follows. Each stream uses 
the data structures for out-of-order streams (e.g., 0HO]) to maintain the local estimates. 
Then each stream uses our communication algorithms for in-order streams. It is obvious the 
root can answer the corresponding queries. For the communication cost, consider any time 
interval P = [t — (W — r) + 1, t] of size W — r. Items arriving in P must have time-stamps in 
[t — W + 1, t]. Using the same arguments as before, we can show the same communication 
cost of each algorithm, but only for a window of size W — r instead of W . Equivalently, in 
any window of size W, the communication cost is increased by a factor of 0( V ^_ T ). 
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